Open main menu
首页
专栏
课程
分类
归档
Chat
Sci-Hub
谷歌学术
Libgen
GitHub镜像
登录/注册
搜索
关闭
Previous
Previous
Next
Next
OpenAI - tiktoken ⏳ | fast BPE tokeniser
sockstack
/
585
/
2024-02-27 00:02:38
<p><span style="color: red; font-size: 18px">ChatGPT 可用网址,仅供交流学习使用,如对您有所帮助,请收藏并推荐给需要的朋友。</span><br><a href="https://ckai.xyz/?sockstack§ion=detail" target="__blank">https://ckai.xyz</a><br><br></p> <article class="baidu_pl"><div id="article_content" class="article_content clearfix"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/kdoc_html_views-1a98987dfd.css"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/ck_htmledit_views-25cebea3f9.css"> <div id="content_views" class="markdown_views prism-atom-one-light"> <svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><path stroke-linecap="round" d="M5,0 0,2.5 5,5z" id="raphael-marker-block" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);"></path></svg><p></p> <div class="toc"> <h3>文章目录</h3> <ul><li> <ul> <li>关于 ⏳ tiktoken</li> <li> <ul> <li>性能表现</li> <li>安装</li> <li>tiktoken 如何计算 token</li> <li>Encodings</li> <li>Tokenizer libraries 对不同编程语言的支持</li> <li>How strings are typically tokenized</li> </ul> </li> <li>使用</li> <li> <ul> <li>编解码</li> <li>比较 encodings</li> <li>计算chat API调用的tokens</li> <li>拓展 tiktoken</li> </ul> </li> </ul> </li></ul> </div> <p></p> <hr> <h2> <a id="__tiktoken_2"></a>关于 ⏳ tiktoken</h2> <p>tiktoken is a fast BPE tokeniser for use with OpenAI’s models.<br> 初看这个名字,以为是跟 tiktok 相关,没想到是 openai 下面的,这取名还真是有趣呢。</p> <ul> <li>github https://github.com/openai/tiktoken</li> <li>openai-cookbook / examples / How_to_count_tokens_with_tiktoken.ipynb<br> https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb</li> </ul> <hr> <h3> <a id="_14"></a>性能表现</h3> <p>tiktoken 比其他开源 tokeniser 快 3-6 倍<br> 基于 1GB 文本进行测试,使用 GPT-2 tokeniser,使用 <code>GPT2TokenizerFast</code> from <code>tokenizers==0.13.2</code>, <code>transformers==4.24.0</code> and <code>tiktoken==0.2.0</code>。</p> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/a852581e177340d389d8e0c0147a7bbb.png" alt="在这里插入图片描述"></p> <hr> <h3> <a id="_22"></a>安装</h3> <pre><code class="prism language-shell">pip <span class="token function">install</span> tiktoken </code></pre> <hr> <h3> <a id="tiktoken__token_30"></a>tiktoken 如何计算 token</h3> <p>给定一个文本字符:<code>"tiktoken is great!"</code>,和一个 encoding,比如 <code>"cl100k_base"</code>。<br> 一个 tokenizer 可以讲文本字符串分割成一系列 tokens,如: <code>["t", "ik", "token", " is", " great", "!"]</code></p> <p>GPT 模型使用这种类型的 token。<br> 知道文本字符串中有多少令牌,可以告诉你(a)字符串是否太长,文本模型无法处理,以及(b)OpenAI API调用的成本(因为使用是按令牌定价的)。</p> <hr> <h3> <a id="Encodings_41"></a>Encodings</h3> <p>编码指定如何将文本转换为标记。不同的模型使用不同的编码。</p> <p>OpenAI models 使用 <code>tiktoken</code> 支持下面三种编码:</p> <table> <thead><tr> <th>Encoding name</th> <th>OpenAI models</th> </tr></thead> <tbody> <tr> <td><code>cl100k_base</code></td> <td> <code>gpt-4</code>, <code>gpt-3.5-turbo</code>, <code>text-embedding-ada-002</code> </td> </tr> <tr> <td><code>p50k_base</code></td> <td>Codex models, <code>text-davinci-002</code>, <code>text-davinci-003</code> </td> </tr> <tr> <td> <code>r50k_base</code> (or <code>gpt2</code>)</td> <td>GPT-3 models like <code>davinci</code> </td> </tr> </tbody> </table> <p>您可以获取一个模型的编码 ,使用 <code>tiktoken.encoding_for_model()</code> 如下:</p> <pre><code class="prism language-python">encoding <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>encoding_for_model<span class="token punctuation">(</span><span class="token string">'gpt-3.5-turbo'</span><span class="token punctuation">)</span> </code></pre> <p>注意,<code>p50k_base</code> 与 <code>r50k_base</code> 基本类似,对于非代码应用程序,它们通常会给出相同的令牌。</p> <hr> <h3> <a id="Tokenizer_libraries__63"></a>Tokenizer libraries 对不同编程语言的支持</h3> <p>对于 <code>cl100k_base</code> 和 <code>p50k_base</code> encodings:</p> <ul> <li>Python: tiktoken</li> <li>.NET / C#: SharpToken</li> </ul> <hr> <p>对于 <code>r50k_base</code> (<code>gpt2</code>) encodings, tokenizers are available in many languages.</p> <ul> <li>Python: tiktoken (or alternatively GPT2TokenizerFast)</li> <li>JavaScript: gpt-3-encoder</li> <li>.NET / C#: GPT Tokenizer</li> <li>Java: gpt2-tokenizer-java</li> <li>PHP: GPT-3-Encoder-PHP</li> </ul> <p>(OpenAI不对第三方库进行背书或保证。)</p> <hr> <h3> <a id="How_strings_are_typically_tokenized_84"></a>How strings are typically tokenized</h3> <p>In English, tokens commonly range in length from one character to one word (e.g., <code>"t"</code> or <code>" great"</code>), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., <code>" is"</code> instead of <code>"is "</code> or <code>" "</code>+<code>"is"</code>). You can quickly check how a string is tokenized at the OpenAI Tokenizer.</p> <p>在英语中,tokens的长度通常从一个字符到一个单词(例如,<code>t</code> 或 <code>great</code> ),尽管在一些语言中,tokens 可以短于一个字符或长于一个单词。<br> 空格通常以单词的开头分组(例如,<code> is</code> 而不是 <code>is</code> 或 <code> </code>+ <code>is</code>。<br> 您可以在[OpenAI Tokenizer]快速检查字符串是如何tokenize的。</p> <p>OpenAI Tokenizer : https://beta.openai.com/tokenizer</p> <hr> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/e81f96dcb632484e8d1d463b4c0d5145.png" alt="在这里插入图片描述" width="500"></p> <hr> <h2> <a id="_100"></a>使用</h2> <h3> <a id="_102"></a>编解码</h3> <pre><code class="prism language-python"><span class="token keyword">import</span> tiktoken </code></pre> <pre><code class="prism language-python"><span class="token comment"># 使用名字加载 encoding</span> <span class="token comment"># 第一次运行时,可能需要连接互联网来下载;下一次不需要联网</span> encoding <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>get_encoding<span class="token punctuation">(</span><span class="token string">"cl100k_base"</span><span class="token punctuation">)</span><span class="token comment"># 对于给定的模型名,自动加载正确的 encoding </span> encoding <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>encoding_for_model<span class="token punctuation">(</span><span class="token string">"gpt-3.5-turbo"</span><span class="token punctuation">)</span><span class="token comment"># 将文本转化为 tokens 列表</span> encoding<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"tiktoken is great!"</span><span class="token punctuation">)</span> <span class="token comment"># [83, 1609, 5963, 374, 2294, 0]</span> </code></pre> <pre><code class="prism language-python"><span class="token comment"># 计算 encode 返回列表的长度</span> <span class="token keyword">def</span> <span class="token function">num_tokens_from_string</span><span class="token punctuation">(</span>string<span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">,</span> encoding_name<span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token builtin">int</span><span class="token punctuation">:</span><span class="token triple-quoted-string string">"""Returns the number of tokens in a text string."""</span>encoding <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>get_encoding<span class="token punctuation">(</span>encoding_name<span class="token punctuation">)</span>num_tokens <span class="token operator">=</span> <span class="token builtin">len</span><span class="token punctuation">(</span>encoding<span class="token punctuation">.</span>encode<span class="token punctuation">(</span>string<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token keyword">return</span> num_tokens </code></pre> <pre><code class="prism language-python">num_tokens_from_string<span class="token punctuation">(</span><span class="token string">"tiktoken is great!"</span><span class="token punctuation">,</span> <span class="token string">"cl100k_base"</span><span class="token punctuation">)</span> <span class="token comment"># 6</span> </code></pre> <hr> <pre><code class="prism language-python"><span class="token comment"># 将 tokens 转化为 文本</span> encoding<span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">83</span><span class="token punctuation">,</span> <span class="token number">1609</span><span class="token punctuation">,</span> <span class="token number">5963</span><span class="token punctuation">,</span> <span class="token number">374</span><span class="token punctuation">,</span> <span class="token number">2294</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment"># 'tiktoken is great!'</span> </code></pre> <hr> <p>警告:尽管 <code>.decode()</code> 可以应用于单个令牌,但要注意,对于不在utf-8边界上的令牌,它可能会有损耗。<br> 对于单个 tokens,<code>.decode_single_token_bytes()</code> 方法安全地将单个整数令牌转换为它所代表的字节。</p> <pre><code class="prism language-python"><span class="token punctuation">[</span>encoding<span class="token punctuation">.</span>decode_single_token_bytes<span class="token punctuation">(</span>token<span class="token punctuation">)</span> <span class="token keyword">for</span> token <span class="token keyword">in</span> <span class="token punctuation">[</span><span class="token number">83</span><span class="token punctuation">,</span> <span class="token number">1609</span><span class="token punctuation">,</span> <span class="token number">5963</span><span class="token punctuation">,</span> <span class="token number">374</span><span class="token punctuation">,</span> <span class="token number">2294</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">]</span> <span class="token comment"># [b't', b'ik', b'token', b' is', b' great', b'!']</span> </code></pre> <p>(字符串前面的 <code>b</code> 表示字符串是字节字符串。)</p> <hr> <h3> <a id="_encodings_161"></a>比较 encodings</h3> <p>不同的编码在拆分单词、组空格和处理非英语字符的方式上各不相同。使用上面的方法,我们可以比较几个示例字符串的不同编码。</p> <pre><code class="prism language-python"><span class="token keyword">def</span> <span class="token function">compare_encodings</span><span class="token punctuation">(</span>example_string<span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token boolean">None</span><span class="token punctuation">:</span><span class="token triple-quoted-string string">"""Prints a comparison of three string encodings."""</span><span class="token comment"># print the example string</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f'\nExample string: "</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>example_string<span class="token punctuation">}</span></span><span class="token string">"'</span></span><span class="token punctuation">)</span><span class="token comment"># for each encoding, print the # of tokens, the token integers, and the token bytes</span><span class="token keyword">for</span> encoding_name <span class="token keyword">in</span> <span class="token punctuation">[</span><span class="token string">"gpt2"</span><span class="token punctuation">,</span> <span class="token string">"p50k_base"</span><span class="token punctuation">,</span> <span class="token string">"cl100k_base"</span><span class="token punctuation">]</span><span class="token punctuation">:</span>encoding <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>get_encoding<span class="token punctuation">(</span>encoding_name<span class="token punctuation">)</span>token_integers <span class="token operator">=</span> encoding<span class="token punctuation">.</span>encode<span class="token punctuation">(</span>example_string<span class="token punctuation">)</span>num_tokens <span class="token operator">=</span> <span class="token builtin">len</span><span class="token punctuation">(</span>token_integers<span class="token punctuation">)</span>token_bytes <span class="token operator">=</span> <span class="token punctuation">[</span>encoding<span class="token punctuation">.</span>decode_single_token_bytes<span class="token punctuation">(</span>token<span class="token punctuation">)</span> <span class="token keyword">for</span> token <span class="token keyword">in</span> token_integers<span class="token punctuation">]</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>encoding_name<span class="token punctuation">}</span></span><span class="token string">: </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>num_tokens<span class="token punctuation">}</span></span><span class="token string"> tokens"</span></span><span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"token integers: </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>token_integers<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"token bytes: </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>token_bytes<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span></code></pre> <hr> <pre><code class="prism language-python">compare_encodings<span class="token punctuation">(</span><span class="token string">"antidisestablishmentarianism"</span><span class="token punctuation">)</span> </code></pre> <pre><code class="prism language-python">Example string<span class="token punctuation">:</span> <span class="token string">"antidisestablishmentarianism"</span>gpt2<span class="token punctuation">:</span> <span class="token number">5</span> tokens token integers<span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">415</span><span class="token punctuation">,</span> <span class="token number">29207</span><span class="token punctuation">,</span> <span class="token number">44390</span><span class="token punctuation">,</span> <span class="token number">3699</span><span class="token punctuation">,</span> <span class="token number">1042</span><span class="token punctuation">]</span> token <span class="token builtin">bytes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">b'ant'</span><span class="token punctuation">,</span> <span class="token string">b'idis'</span><span class="token punctuation">,</span> <span class="token string">b'establishment'</span><span class="token punctuation">,</span> <span class="token string">b'arian'</span><span class="token punctuation">,</span> <span class="token string">b'ism'</span><span class="token punctuation">]</span>p50k_base<span class="token punctuation">:</span> <span class="token number">5</span> tokens token integers<span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">415</span><span class="token punctuation">,</span> <span class="token number">29207</span><span class="token punctuation">,</span> <span class="token number">44390</span><span class="token punctuation">,</span> <span class="token number">3699</span><span class="token punctuation">,</span> <span class="token number">1042</span><span class="token punctuation">]</span> token <span class="token builtin">bytes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">b'ant'</span><span class="token punctuation">,</span> <span class="token string">b'idis'</span><span class="token punctuation">,</span> <span class="token string">b'establishment'</span><span class="token punctuation">,</span> <span class="token string">b'arian'</span><span class="token punctuation">,</span> <span class="token string">b'ism'</span><span class="token punctuation">]</span>cl100k_base<span class="token punctuation">:</span> <span class="token number">6</span> tokens token integers<span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">519</span><span class="token punctuation">,</span> <span class="token number">85342</span><span class="token punctuation">,</span> <span class="token number">34500</span><span class="token punctuation">,</span> <span class="token number">479</span><span class="token punctuation">,</span> <span class="token number">8997</span><span class="token punctuation">,</span> <span class="token number">2191</span><span class="token punctuation">]</span> token <span class="token builtin">bytes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">b'ant'</span><span class="token punctuation">,</span> <span class="token string">b'idis'</span><span class="token punctuation">,</span> <span class="token string">b'establish'</span><span class="token punctuation">,</span> <span class="token string">b'ment'</span><span class="token punctuation">,</span> <span class="token string">b'arian'</span><span class="token punctuation">,</span> <span class="token string">b'ism'</span><span class="token punctuation">]</span> </code></pre> <hr> <pre><code class="prism language-python">compare_encodings<span class="token punctuation">(</span><span class="token string">"2 + 2 = 4"</span><span class="token punctuation">)</span> </code></pre> <pre><code class="prism language-python">Example string<span class="token punctuation">:</span> <span class="token string">"2 + 2 = 4"</span>gpt2<span class="token punctuation">:</span> <span class="token number">5</span> tokens token integers<span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">17</span><span class="token punctuation">,</span> <span class="token number">1343</span><span class="token punctuation">,</span> <span class="token number">362</span><span class="token punctuation">,</span> <span class="token number">796</span><span class="token punctuation">,</span> <span class="token number">604</span><span class="token punctuation">]</span> token <span class="token builtin">bytes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">b'2'</span><span class="token punctuation">,</span> <span class="token string">b' +'</span><span class="token punctuation">,</span> <span class="token string">b' 2'</span><span class="token punctuation">,</span> <span class="token string">b' ='</span><span class="token punctuation">,</span> <span class="token string">b' 4'</span><span class="token punctuation">]</span>p50k_base<span class="token punctuation">:</span> <span class="token number">5</span> tokens token integers<span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">17</span><span class="token punctuation">,</span> <span class="token number">1343</span><span class="token punctuation">,</span> <span class="token number">362</span><span class="token punctuation">,</span> <span class="token number">796</span><span class="token punctuation">,</span> <span class="token number">604</span><span class="token punctuation">]</span> token <span class="token builtin">bytes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">b'2'</span><span class="token punctuation">,</span> <span class="token string">b' +'</span><span class="token punctuation">,</span> <span class="token string">b' 2'</span><span class="token punctuation">,</span> <span class="token string">b' ='</span><span class="token punctuation">,</span> <span class="token string">b' 4'</span><span class="token punctuation">]</span>cl100k_base<span class="token punctuation">:</span> <span class="token number">7</span> tokens token integers<span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">17</span><span class="token punctuation">,</span> <span class="token number">489</span><span class="token punctuation">,</span> <span class="token number">220</span><span class="token punctuation">,</span> <span class="token number">17</span><span class="token punctuation">,</span> <span class="token number">284</span><span class="token punctuation">,</span> <span class="token number">220</span><span class="token punctuation">,</span> <span class="token number">19</span><span class="token punctuation">]</span> token <span class="token builtin">bytes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">b'2'</span><span class="token punctuation">,</span> <span class="token string">b' +'</span><span class="token punctuation">,</span> <span class="token string">b' '</span><span class="token punctuation">,</span> <span class="token string">b'2'</span><span class="token punctuation">,</span> <span class="token string">b' ='</span><span class="token punctuation">,</span> <span class="token string">b' '</span><span class="token punctuation">,</span> <span class="token string">b'4'</span><span class="token punctuation">]</span> </code></pre> <hr> <pre><code class="prism language-python">compare_encodings<span class="token punctuation">(</span><span class="token string">"お誕生日おめでとう"</span><span class="token punctuation">)</span> </code></pre> <pre><code class="prism language-python">Example string<span class="token punctuation">:</span> <span class="token string">"お誕生日おめでとう"</span>gpt2<span class="token punctuation">:</span> <span class="token number">14</span> tokens token integers<span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">2515</span><span class="token punctuation">,</span> <span class="token number">232</span><span class="token punctuation">,</span> <span class="token number">45739</span><span class="token punctuation">,</span> <span class="token number">243</span><span class="token punctuation">,</span> <span class="token number">37955</span><span class="token punctuation">,</span> <span class="token number">33768</span><span class="token punctuation">,</span> <span class="token number">98</span><span class="token punctuation">,</span> <span class="token number">2515</span><span class="token punctuation">,</span> <span class="token number">232</span><span class="token punctuation">,</span> <span class="token number">1792</span><span class="token punctuation">,</span> <span class="token number">223</span><span class="token punctuation">,</span> <span class="token number">30640</span><span class="token punctuation">,</span> <span class="token number">30201</span><span class="token punctuation">,</span> <span class="token number">29557</span><span class="token punctuation">]</span> token <span class="token builtin">bytes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">b'\xe3\x81'</span><span class="token punctuation">,</span> <span class="token string">b'\x8a'</span><span class="token punctuation">,</span> <span class="token string">b'\xe8\xaa'</span><span class="token punctuation">,</span> <span class="token string">b'\x95'</span><span class="token punctuation">,</span> <span class="token string">b'\xe7\x94\x9f'</span><span class="token punctuation">,</span> <span class="token string">b'\xe6\x97'</span><span class="token punctuation">,</span> <span class="token string">b'\xa5'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81'</span><span class="token punctuation">,</span> <span class="token string">b'\x8a'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x82'</span><span class="token punctuation">,</span> <span class="token string">b'\x81'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81\xa7'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81\xa8'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81\x86'</span><span class="token punctuation">]</span>p50k_base<span class="token punctuation">:</span> <span class="token number">14</span> tokens token integers<span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">2515</span><span class="token punctuation">,</span> <span class="token number">232</span><span class="token punctuation">,</span> <span class="token number">45739</span><span class="token punctuation">,</span> <span class="token number">243</span><span class="token punctuation">,</span> <span class="token number">37955</span><span class="token punctuation">,</span> <span class="token number">33768</span><span class="token punctuation">,</span> <span class="token number">98</span><span class="token punctuation">,</span> <span class="token number">2515</span><span class="token punctuation">,</span> <span class="token number">232</span><span class="token punctuation">,</span> <span class="token number">1792</span><span class="token punctuation">,</span> <span class="token number">223</span><span class="token punctuation">,</span> <span class="token number">30640</span><span class="token punctuation">,</span> <span class="token number">30201</span><span class="token punctuation">,</span> <span class="token number">29557</span><span class="token punctuation">]</span> token <span class="token builtin">bytes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">b'\xe3\x81'</span><span class="token punctuation">,</span> <span class="token string">b'\x8a'</span><span class="token punctuation">,</span> <span class="token string">b'\xe8\xaa'</span><span class="token punctuation">,</span> <span class="token string">b'\x95'</span><span class="token punctuation">,</span> <span class="token string">b'\xe7\x94\x9f'</span><span class="token punctuation">,</span> <span class="token string">b'\xe6\x97'</span><span class="token punctuation">,</span> <span class="token string">b'\xa5'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81'</span><span class="token punctuation">,</span> <span class="token string">b'\x8a'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x82'</span><span class="token punctuation">,</span> <span class="token string">b'\x81'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81\xa7'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81\xa8'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81\x86'</span><span class="token punctuation">]</span>cl100k_base<span class="token punctuation">:</span> <span class="token number">9</span> tokens token integers<span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">33334</span><span class="token punctuation">,</span> <span class="token number">45918</span><span class="token punctuation">,</span> <span class="token number">243</span><span class="token punctuation">,</span> <span class="token number">21990</span><span class="token punctuation">,</span> <span class="token number">9080</span><span class="token punctuation">,</span> <span class="token number">33334</span><span class="token punctuation">,</span> <span class="token number">62004</span><span class="token punctuation">,</span> <span class="token number">16556</span><span class="token punctuation">,</span> <span class="token number">78699</span><span class="token punctuation">]</span> token <span class="token builtin">bytes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">b'\xe3\x81\x8a'</span><span class="token punctuation">,</span> <span class="token string">b'\xe8\xaa'</span><span class="token punctuation">,</span> <span class="token string">b'\x95'</span><span class="token punctuation">,</span> <span class="token string">b'\xe7\x94\x9f'</span><span class="token punctuation">,</span> <span class="token string">b'\xe6\x97\xa5'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81\x8a'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x82\x81'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81\xa7'</span><span class="token punctuation">,</span> <span class="token string">b'\xe3\x81\xa8\xe3\x81\x86'</span><span class="token punctuation">]</span> </code></pre> <hr> <h3> <a id="chat_APItokens_268"></a>计算chat API调用的tokens</h3> <p>ChatGPT models like <code>gpt-3.5-turbo</code> and <code>gpt-4</code> use tokens in the same way as older completions models, but because of their message-based formatting, it’s more difficult to count how many tokens will be used by a conversation.</p> <p>Below is an example function for counting tokens for messages passed to <code>gpt-3.5-turbo-0301</code> or <code>gpt-4-0314</code>.</p> <p>Note that the exact way that tokens are counted from messages may change from model to model. Consider the counts from the function below an estimate, not a timeless guarantee.</p> <p>像 <code>gpt-3.5-turbo</code> 和 <code>gpt-4</code> 这样的ChatGPT模型使用tokens 的方式与旧的完成模型相同,但由于它们基于消息的格式,很难计算会话将使用多少tokens。<br> 下面是一个示例函数,用于对传递到 <code>gpt-3.5-turbo-0301</code> 或 <code>gpt-4-0314</code> 的消息的tokens进行计数。<br> 请注意,从消息中计算tokens的确切方式可能会因模型而异。将函数中的计数视为一个估计值,而不是一个永恒的保证。</p> <hr> <pre><code class="prism language-python"><span class="token keyword">def</span> <span class="token function">num_tokens_from_messages</span><span class="token punctuation">(</span>messages<span class="token punctuation">,</span> model<span class="token operator">=</span><span class="token string">"gpt-3.5-turbo-0301"</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token triple-quoted-string string">"""Returns the number of tokens used by a list of messages."""</span><span class="token keyword">try</span><span class="token punctuation">:</span>encoding <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>encoding_for_model<span class="token punctuation">(</span>model<span class="token punctuation">)</span><span class="token keyword">except</span> KeyError<span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Warning: model not found. Using cl100k_base encoding."</span><span class="token punctuation">)</span>encoding <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>get_encoding<span class="token punctuation">(</span><span class="token string">"cl100k_base"</span><span class="token punctuation">)</span><span class="token keyword">if</span> model <span class="token operator">==</span> <span class="token string">"gpt-3.5-turbo"</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301."</span><span class="token punctuation">)</span><span class="token keyword">return</span> num_tokens_from_messages<span class="token punctuation">(</span>messages<span class="token punctuation">,</span> model<span class="token operator">=</span><span class="token string">"gpt-3.5-turbo-0301"</span><span class="token punctuation">)</span><span class="token keyword">elif</span> model <span class="token operator">==</span> <span class="token string">"gpt-4"</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314."</span><span class="token punctuation">)</span><span class="token keyword">return</span> num_tokens_from_messages<span class="token punctuation">(</span>messages<span class="token punctuation">,</span> model<span class="token operator">=</span><span class="token string">"gpt-4-0314"</span><span class="token punctuation">)</span><span class="token keyword">elif</span> model <span class="token operator">==</span> <span class="token string">"gpt-3.5-turbo-0301"</span><span class="token punctuation">:</span>tokens_per_message <span class="token operator">=</span> <span class="token number">4</span> <span class="token comment"># every message follows <|start|>{role/name}\n{content}<|end|>\n</span>tokens_per_name <span class="token operator">=</span> <span class="token operator">-</span><span class="token number">1</span> <span class="token comment"># if there's a name, the role is omitted</span><span class="token keyword">elif</span> model <span class="token operator">==</span> <span class="token string">"gpt-4-0314"</span><span class="token punctuation">:</span>tokens_per_message <span class="token operator">=</span> <span class="token number">3</span>tokens_per_name <span class="token operator">=</span> <span class="token number">1</span><span class="token keyword">else</span><span class="token punctuation">:</span><span class="token keyword">raise</span> NotImplementedError<span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"""num_tokens_from_messages() is not implemented for model </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>model<span class="token punctuation">}</span></span><span class="token string">. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""</span></span><span class="token punctuation">)</span>num_tokens <span class="token operator">=</span> <span class="token number">0</span><span class="token keyword">for</span> message <span class="token keyword">in</span> messages<span class="token punctuation">:</span>num_tokens <span class="token operator">+=</span> tokens_per_message<span class="token keyword">for</span> key<span class="token punctuation">,</span> value <span class="token keyword">in</span> message<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>num_tokens <span class="token operator">+=</span> <span class="token builtin">len</span><span class="token punctuation">(</span>encoding<span class="token punctuation">.</span>encode<span class="token punctuation">(</span>value<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token keyword">if</span> key <span class="token operator">==</span> <span class="token string">"name"</span><span class="token punctuation">:</span>num_tokens <span class="token operator">+=</span> tokens_per_namenum_tokens <span class="token operator">+=</span> <span class="token number">3</span> <span class="token comment"># every reply is primed with <|start|>assistant<|message|></span><span class="token keyword">return</span> num_tokens </code></pre> <hr> <pre><code class="prism language-python"><span class="token comment"># let's verify the function above matches the OpenAI API response</span><span class="token keyword">import</span> openaiexample_messages <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"You are a helpful, pattern-following assistant that translates corporate jargon into plain English."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"name"</span><span class="token punctuation">:</span> <span class="token string">"example_user"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"New synergies will help drive top-line growth."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"name"</span><span class="token punctuation">:</span> <span class="token string">"example_assistant"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"Things working well together will increase revenue."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"name"</span><span class="token punctuation">:</span> <span class="token string">"example_user"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"name"</span><span class="token punctuation">:</span> <span class="token string">"example_assistant"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"Let's talk later when we're less busy about how to do better."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"user"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"This late pivot means we don't have time to boil the ocean for the client deliverable."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span> <span class="token punctuation">]</span><span class="token keyword">for</span> model <span class="token keyword">in</span> <span class="token punctuation">[</span><span class="token string">"gpt-3.5-turbo-0301"</span><span class="token punctuation">,</span> <span class="token string">"gpt-4-0314"</span><span class="token punctuation">]</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span>model<span class="token punctuation">)</span><span class="token comment"># example token count from the function defined above</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>num_tokens_from_messages<span class="token punctuation">(</span>example_messages<span class="token punctuation">,</span> model<span class="token punctuation">)</span><span class="token punctuation">}</span></span><span class="token string"> prompt tokens counted by num_tokens_from_messages()."</span></span><span class="token punctuation">)</span><span class="token comment"># example token count from the OpenAI API</span>response <span class="token operator">=</span> openai<span class="token punctuation">.</span>ChatCompletion<span class="token punctuation">.</span>create<span class="token punctuation">(</span>model<span class="token operator">=</span>model<span class="token punctuation">,</span>messages<span class="token operator">=</span>example_messages<span class="token punctuation">,</span>temperature<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">,</span>max_tokens<span class="token operator">=</span><span class="token number">1</span> <span class="token comment"># we're only counting input tokens here, so let's not waste tokens on the output</span><span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f'</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>response<span class="token punctuation">[</span><span class="token string">"usage"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"prompt_tokens"</span><span class="token punctuation">]</span><span class="token punctuation">}</span></span><span class="token string"> prompt tokens counted by the OpenAI API.'</span></span><span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre> <hr> <pre><code class="prism language-python">gpt<span class="token operator">-</span><span class="token number">3.5</span><span class="token operator">-</span>turbo<span class="token operator">-</span><span class="token number">0301</span> <span class="token number">127</span> prompt tokens counted by num_tokens_from_messages<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span> <span class="token number">127</span> prompt tokens counted by the OpenAI API<span class="token punctuation">.</span>gpt<span class="token operator">-</span><span class="token number">4</span><span class="token operator">-</span><span class="token number">0314</span> <span class="token number">129</span> prompt tokens counted by num_tokens_from_messages<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span> <span class="token number">129</span> prompt tokens counted by the OpenAI API<span class="token punctuation">.</span> </code></pre> <hr> <h3> <a id="_tiktoken_388"></a>拓展 tiktoken</h3> <p>您可能希望扩展 tiktoken 以支持新的编码。有两种方法可以做到这一点。<br> 按照您想要的方式创建Encoding对象,然后简单地传递它。</p> <p>方式一:</p> <pre><code class="prism language-python">cl100k_base <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>get_encoding<span class="token punctuation">(</span><span class="token string">"cl100k_base"</span><span class="token punctuation">)</span><span class="token comment"># In production, load the arguments directly instead of accessing private attributes</span> <span class="token comment"># See openai_public.py for examples of arguments for specific encodings</span> enc <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>Encoding<span class="token punctuation">(</span><span class="token comment"># If you're changing the set of special tokens, make sure to use a different name</span><span class="token comment"># It should be clear from the name what behaviour to expect.</span>name<span class="token operator">=</span><span class="token string">"cl100k_im"</span><span class="token punctuation">,</span>pat_str<span class="token operator">=</span>cl100k_base<span class="token punctuation">.</span>_pat_str<span class="token punctuation">,</span>mergeable_ranks<span class="token operator">=</span>cl100k_base<span class="token punctuation">.</span>_mergeable_ranks<span class="token punctuation">,</span>special_tokens<span class="token operator">=</span><span class="token punctuation">{<!-- --></span><span class="token operator">**</span>cl100k_base<span class="token punctuation">.</span>_special_tokens<span class="token punctuation">,</span><span class="token string">"<|im_start|>"</span><span class="token punctuation">:</span> <span class="token number">100264</span><span class="token punctuation">,</span><span class="token string">"<|im_end|>"</span><span class="token punctuation">:</span> <span class="token number">100265</span><span class="token punctuation">,</span><span class="token punctuation">}</span> <span class="token punctuation">)</span> </code></pre> <hr> <p>方式二:<br> 使用 tiktoken_ext 插件机制 向tiktoken注册Encoding对象。<br> 只有当您需要 <code>tiktoken.get_encoding</code> 来查找您的编码时,这才有用,否则更适合上面方式1。<br> 要做到这一点,您需要在 <code>tiktoken_ext</code> 下创建一个命名空间包。<br> 这样布局你的项目,确保省略 <code>tiktoken_ext/__init__.py</code>文件:</p> <pre><code class="prism language-python">my_tiktoken_extension ├── tiktoken_ext │ └── my_encodings<span class="token punctuation">.</span>py └── setup<span class="token punctuation">.</span>py </code></pre> <hr> <p><code>my_encodings.py</code> 应该是一个包含名为 <code>ENCODING_CONSTRUCTORS</code> 的变量的模块。<br> 这是一个从编码名称到函数的字典,该函数不接受参数,并返回可以传递给 tiktoken.encoding 的参数来构造该编码。<br> 例如,请参阅 <code>tiktoken_ext/openai_public.py</code>。有关详细信息,请参阅 <code>tiktoken/registry.py</code> 。<br> 你的setup.py 应该是这样的:</p> <pre><code class="prism language-python"><span class="token keyword">from</span> setuptools <span class="token keyword">import</span> setup<span class="token punctuation">,</span> find_namespace_packagessetup<span class="token punctuation">(</span>name<span class="token operator">=</span><span class="token string">"my_tiktoken_extension"</span><span class="token punctuation">,</span>packages<span class="token operator">=</span>find_namespace_packages<span class="token punctuation">(</span>include<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">'tiktoken_ext*'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span>install_requires<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"tiktoken"</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token punctuation">)</span> </code></pre> <p>然后简单地执行 <code>pip install ./my_tiktoken_extension</code>,您应该能够使用自定义编码!请确保不要使用可编辑安装。</p> <hr> <p>2023-03-31(五)</p> </div> <link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/markdown_views-98b95bb57c.css" rel="stylesheet"> <link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/style-c216769e99.css" rel="stylesheet"> </div> <div id="treeSkill"></div></article>
OpenAI - tiktoken ⏳ | fast BPE tokeniser
作者
sockstack
许可协议
CC BY 4.0
发布于
2024-02-27
修改于
2025-04-15
上一篇:软件:常用 Linux 软件汇总,值得收藏
下一篇:大规模语言模型微调技术——Instruction和Question的区别和联系
尚未登录
登录 / 注册
文章分类
博客重构之路
5
Spring Boot简单入门
4
k8s 入门教程
0
MySQL 知识
1
NSQ 消息队列
0
ThinkPHP5 源码分析
5
使用 Docker 从零开始搭建私人代码仓库
3
日常开发汇总
4
标签列表
springboot
hyperf
swoole
webman
php
多线程
数据结构
docker
k8s
thinkphp
mysql
tailwindcss
flowbite
css
前端